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Leveling the Field on Math and Science Tests 
for Students with Learning Disabilities 



Key Concepts 

Standards for Educational 
and Psychological Testing 
(American Educational 
Research Association, American 
Psychological Association, 

& National Council on 
Measurement in Education, 

1999) provides these definitions 
for three concepts that are central 
to the topic of testing students 
with disabilities: 

• Validity — “The degree to 
which accumulated evidence 
and theory support specific 
interpretations of test scores 
entailed by proposed uses of 
a test” (p. 184). 

• Construct — “The concept 
or the characteristic that a 
test is designed to measure” 
(p. 173). 

• Fairness — “In testing, the 
principle that every test taker 
should be assessed in an 
equitable way” (p. 175). 



By Elizabeth Stone and Linda Cook 

I f scores on a state math or scienee assessment have 
been evaluated using data only from test takers with- 
out learning disabilities, can we assume the inferences 
made based on the test’s scores will be fair and valid for all 
students? 

What is a nonstandard test administration? Does admin- 
istering the test under such conditions affect the fairness 
of the test or the validity of the inferences made from the 
test’s scores? 

And how do we know whether test scores reflect the 
same skills and knowledge for different groups of students 
— specifically for students with disabilities? 

For anyone trying to understand the meaning of scores 
on state standards-based aehievement tests, these are 
important questions to ask. Over the past few years, ETS 
has carried out studies of state standards-based achievement 
tests that were administered to students with disabilities. 
Specifleally, the studies sought evidence about the fairness 
and validity of the inferences made from scores on these 
tests. These studies have considered those students with 
disabilities who took tests under standard eonditions and 
those who took the tests under nonstandard eonditions. 

In the context of educational measurement, the terms we 
use in this article have meanings that may differ slightly 
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from the ways in which they are used in general 
conversation, so it is important to clarify: 

Validity is defined in Standards for Educa- 
tional and Psychological Testing as the “degree 
to which evidence and theory support the inter- 
pretations of test scores entailed by proposed 
uses of tests” (American Educational Research 
Association, American Psychological Associa- 
tion, & National Council on Measurement in 
Education, 1999, p. 9). 

The same publication (AERA, APA, & 
NCME, 1999) gives this general definition of 
fairness: “In testing, the 
principle that every test 
taker should be assessed in 
an equitable way” (p. 175). 

By state assessments 
or state standards-based 
achievement tests, we 
mean the tests that states 
require in order to deter- 
mine whether students and 
districts are meeting the 
educational achievement 
goals established by state 
and federal mandates. 

The validity-related literature that we refer- 
ence gives great importance to the term con- 
struct and uses it frequently. Standards for 
Educational and Psychological Testing defines 
construct as the “concept or the characteristic 
that a test is designed to measure” (AERA, 
APA, & NCME, 1999, p.l73). 

The term accommodation as we use it in this 
article is reserved for test changes that a state 
believes do not alter the underlying construct 
measured by the test. The term modification 
refers to test changes that a state believes may 
alter the underlying construct that the test 
measures. 



Why is validity such an important concept 
in testing? If test scores are to be used to make 
decisions that affect the direction of a test 
taker’s life — for example, to make decisions 
about admission to college or graduate school, 
to allow a student to graduate, or to allow 
an examinee to receive certification — it is 
imperative that the test scores provide infor- 
mation that allows those interpretations to be 
made appropriately. 

While there is a justifiable concern about 
the way test score interpretations affect indi- 
viduals, consider also that test scores are used 
in the K-12 setting for 
accountability purposes 
that lead to evaluations and 
comparisons at the school 
and state levels. Inferences 
made at these levels can 
affect funding, staffing, and 
even curriculum develop- 
ment, and thus it is impor- 
tant to realize that test score 
validity can have more far- 
reaching consequences than 
is immediately apparent. 

In the case of the way these issues are 
discussed in this article, validity means this: 
When K-12 students with learning disabilities 
take state standards-based achievement tests in 
math and science, do their scores refiect their 
actual knowledge and skills in these subjects, 
or do they refiect the presence of some other 
factors that are unrelated to math and science? 

More Inclusion 

Research studies have shown that a smaller 
percentage of students with learning dis- 
abilities participate in state assessments than 
do their peers without learning disabilities. 
Eurthermore, there is almost always a perfor- 



When students with 
learning disabilities 
take achievement tests 
in math and science, do 
their scores reflect their 
knowledge and skills, or 
do they reflect the presence 
of other unrelated factors? 



2 



www.ets.org 



R&D Connections • No. 12 • December 2009 



Listening. Learning. Leading^ 



mance gap between these groups of students 
on these assessments. 

It is important to evaluate whether a per- 
formance gap on a state test is truly due to 
differences in proficiency or whether there are 
obstacles irrelevant to what the test is supposed 
to measure that are preventing students with 
disabilities from demonstrating the full extent 
of their knowledge. 

In this article, we discuss some research that 
has taken place involving the issues of partici- 
pation and proficiency. The article also provides 
examples of some of the work that ETS has 
done to examine validity 
and fairness issues related 
to scores on state standards- 
based math and science tests 
administered to students 
with learning disabilities. 

For a long time, students 
with learning disabilities 
were generally instructed 
and tested separately from 
students without learning 
disabilities. The No Child 
Left Behind Act (NCLB) was one step that 
helped to change this situation. NCLB requires 
schools to demonstrate not only the proficiency 
of their total student population, but also the 
performance of different demographic groups 
within that population. Students with disabili- 
ties make up one such accountability subgroup. 

The mandatory inclusion of students with 
disabilities in calculations of adequate yearly 
progress — ^which K-12 educators often refer to 
as AYP — is an important, but also challenging, 
requirement for schools to meet. 

The number of students with disabilities in 
U.S. public schools is quite large. According 
to findings based on the National Center for 



Educational Statistics (NCES) Common Core 
of Data, in the 2003-2004 school year, more 
than 6 million students with disabilities — 
approximately 14% of all students — attended 
U.S. public schools (Cortiella, 2007). 

Of these students with disabilities, about 
46% were classified as having specific learn- 
ing disabilities, which in this case refers to “a 
disorder in 1 or more of the basic psychological 
processes involved in understanding or in using 
language, spoken or written, which disorder 
may manifest itself in the imperfect ability to 
listen, think, speak, read, write, spell, or do 
mathematical calculations” (Individuals with 
Disabilities Education Act 
[IDEA], 1997, 2004). 

This definition is part of a 
body of U.S. legal code that 
has been developed over the 
years to address the edu- 
cational needs of students 
with disabilities. One of the 
most widely cited parts of 
the legal code related to stu- 
dents with disabilities, the 
IDEA (1997, 2004) requires states to provide a 
means for participation in assessments. 

One way to include students with learning 
disabilities is to allow them to take tests under 
nonstandard conditions, using various types of 
testing modifications or testing accommoda- 
tions. Generally, when states feel that nonstan- 
dard testing conditions do not change the test’s 
underlying construct — the knowledge, skills, 
or abilities it targets — they include the results 
of the assessments when they calculate their 
AYP. When states feel that the nonstandard 
testing conditions do change what is being 
tested, they do not include that student’s scores 
in calculation of AYP. 



The mandatory inclusion 
of students with disabilities 
in measures of adequate 
yearly progress is an 
important, but also 
challenging, requirement 
for schools to meet. 
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Categories of Test Changes 

Cortiella (2005) describes 
examples of four kinds of 
accommodations for students with 
learning disabilities: 

Presentation 

• Large print or braille 
test forms 

• Magnification devices 

• Calculators 

• Arithmetic tables 

• Audio accommodations 
Response 

• Marking answers in the test 
booklet rather than on a 
separate answer sheet 

• Having another individual 
record answers for the 
student 

Timing and scheduiing 

• Allowing the student 
extended time to complete 
the test 

• Splitting the test session up 
into multiple sessions 

• Allowing the student to take 
breaks more frequently 
than are usually provided 

Test setting 

• Testing somewhere 
separate from other 
students — for example, at 
home or in a quiet corner 



By way of illustration, providing extended time is often 
considered to be a minor change in testing conditions that 
does not change what is being tested. On the other hand, 
the use of a calculator on a mathematics assessment is 
frequently considered to be a change that does affect what 
is being tested. 

It is important to recognize that the definition of a rea- 
sonable accommodation may vary by state and by assess- 
ment. Cortiella (2005) discusses four types of accommo- 
dations, which we will refer to later. 

Finding Performance Evidence 

It is often difficult to get a good sense of how students 
with disabilities perform on state math and science assess- 
ments when compared with students without disabilities. 
The data necessary to evaluate performance on state 
assessments by any one demographic group — which 
measurement experts often refer to as subgroups — are 
not always readily available. 

Even when subgroup results are available, states often 
report the results of students with learning disabilities as 
part of the larger, more generally-defined group of stu- 
dents with disabilities. This can complicate the task of 
making inferences about the meaning of scores because 
students with different disabilities can have very different 
experiences when taking the same test. 

Studies that have looked at this sparse evidence have 
suggested that students with disabilities do not perform 
as well on state mathematics and science assessments — 
even when they take versions of the tests with accommo- 
dations (VanGetson & Thurlow, 2007; National Center for 
Educational Statistics, 2006). 

Eurthermore, VanGetson and Thurlow (2007) found 
that this achievement gap grows as students get older, so 
that the gap in performance between students with and 
without disabilities is larger in high school than it is in 
elementary school. 

It is important to ask: If accommodations are adminis- 
tered to “level the playing field” (Tindal & Fuchs, 1999), 
do the achievement gaps observed by VanGetson and 
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Thurlow, and others, reflect true differences 
in achievement between students with and 
without disabilities? Or, do the observed dif- 
ferences in performance reflect other aspects 
of students’ disabilities that the accommoda- 
tions have not accounted for and that are not 
relevant to the construct (or characteristic) that 
the test is intended to measure? 

Research on Accommodations 

The complicated set of factors involved in 
testing students with various disabilities has led 
to much research into how these students per- 
form on assessments, the effects of accommo- 
dations on their performance, and the validity of 
inferences people make about their test scores. 

There are many types of 
accommodations that can 
be used on math and sci- 
ence assessments. Cortiella 
(2005) provides the four 
categories often used: pre- 
sentation, response, timing 
and scheduling, and setting. 

Presentation accommodations include 
large print or braille test forms, magniflcation 
devices, calculators, arithmetic tables, and 
audio accommodations such as having a human 
read the test aloud or having an audio-recorded 
reading of the test on tape or CD. The latter 
type is also commonly referred to as a read- 
aloud, audio, or oral accommodation. Presen- 
tation accommodations benefit students who 
have trouble reading test questions or options. 

Response accommodations include marking 
answers in the test booklet rather than on a sep- 
arate answer sheet or having another individual 
record answers for the student. Students who 
are not able to go back and forth between pages 
or who have memory or writing difficulties may 
benefit from this type of accommodation. 



Timing and scheduling accommodations 
include allowing the student extended time to 
complete the test, splitting the test session up 
into multiple sessions, or allowing the student 
to take breaks more frequently than they are 
usually provided. A student who has specific 
types of physical conditions or a student who 
has trouble concentrating for a prolonged 
period of time may benefit from this type of 
accommodation. 

Finally, test setting accommodations involve 
testing somewhere separate from other stu- 
dents, for example at home or in a quiet cor- 
ner. A student who has trouble concentrating 
in large groups may benefit from these types 
of accommodations. 

A goal of using accom- 
modations on assessments 
is to remove obstacles for 
students with disabilities. 
Much of the research into 
assessments for students 
with disabilities focuses on 
how various accommoda- 
tions affect performance or how they change 
what the test measures. 

For instance, students with disabilities may 
need more time to complete a task than stu- 
dents without disabilities need. If the test is not 
intended to measure speed in completing a task, 
then allowing extra time should not affect the 
scores on the test (Sireci, Scarpati, & Li, 2005). 

Time limits usually have a more practical 
function. Studies have shown that, on most 
tests, giving students extra time usually does 
not give them an unfair advantage (Bridgeman, 
McBride, & Monaghan, 2004). Thus, allowing 
extra time should not affect the meaning of test 
scores. 



A goal of allowing 
accommodations on 
assessments is to level the 
playing field for students 
with learning disabilities. 
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Validating Tests 

Professionals validate tests 
against various criteria: 

• Does the test cover the 
right content and does it 
cover enough of it? 

• Is there a link between 
the scores and external 
measures of the same or 
a similar construct? 

• When answering test 
questions, do test takers 
use the skills being 
tested? 

• Are the claims made 
about the test borne out 
in the consequences of 
using the scores? 

• Do statistical analyses 
indicate that performance 
may be related to 
membership in a 
particular demographic 
group? 



A similar argument might be made for allowing students to 
mark answers direetly in the test booklet or having someone 
help them to reeord the responses. If the test is not intended 
to measure the ability to transeribe, allowing this type of 
accommodation should not change the meaning of scores 
(Tindal, Heath, Hollenbeck, Almond, & Hamiss, 1998). 

A more controversial accommodation is a read-aloud 
accommodation (c.f , Bolt & Thurlow, 2006; Bolt & 
Ysseldyke, 2006; Huynh, Meyer, & Gallant, 2004). Most 
of the debate surrounding this accommodation centers on 
whether to allow it for a test of reading skills, but research- 
ers have also explored the use of read-aloud as an appropri- 
ate accommodation for tests that are not intended to mea- 
sure any aspect of reading, such as a math or science test. 

On such tests, where students’ scores are intended to 
reflect their knowledge of math or science, it may make 
sense to consider allowing students with reading-based 
disabilities to use a read-aloud accommodation in order to 
more readily display their skills in math or science. 

Studying Validity and Fairness 

Testing professionals work to ensure that the test scores 
can be used as intended by engaging in validation practices 
that usually involve evaluating the test against various cri- 
teria. Standards for Educational and Psychological Testing 
calls upon test developers and researchers to examine the 
following sources for evidence of validity (AERA, APA, & 
NCME, 1999, pp. 11-15): 

• Test content — Does the test cover the right content 
and does it cover enough of it? To establish this, 
content or subject matter experts develop and review 
the test questions. 

• Relationship of test scores with other variables — Is 
there a link between the scores on an assessment and 
external measures of the same or similar construct? 
For example, if students who perform well on a col- 
lege admissions test also perform well in college, this 
may lend credibility to the predictive claims of the 
admissions test. 



6 



www.ets.org 



R&D Connections • No. 12 • December 2009 



Listening. Learning. Leading^ 



• Evidence of student response processes 
— In attempting to answer a test ques- 
tion, do test takers use the skills being 
tested? This kind of validity research 
can be performed, for example, through 
the use of think aloud or cognitive lab 
procedures that involve listening to test 
takers talk about what they are thinking 
while they ponder test questions. 

• The consequences of assessments — Are 
the claims made about the test borne out 
in the consequences of using the test 
scores in decision making? For example, 
if the test is supposed to be able to 
distinguish between those who will and 
those who will not be successful in a 
particular job, the subsequent job per- 
formance of candidates who were hired 
may provide validity evidence. 

• Investigation of internal structure of 
assessments — Do statistical analyses of 
the test results indicate that test takers’ 
performance on the test may be related 
to their membership in a particular 
demographic group (e.g., students with 
disabilities)! Such investigations may be 
performed using statistical procedures 
known as factor analysis and differential 
item functioning analysis. 

This last bullet describes an important part 
of ETS’s work in this area. 

Powerful Tools 

So what is factor analysis? And what is dif- 
ferential item functioning analysis (also kn own 
as DIF analysis)! 

Factor analysis allows researchers to iden- 
tify underlying dimensions, or factors, that 
a test measures. The general question asked 
in the ETS studies that use factor analysis is 



this: Do the test’s questions measure only one 
dimension, such as reading ability, or more 
than one dimension, such as understanding 
vocabulary and reading comprehension? 

Eactor analysis also can be used to inves- 
tigate whether a test has the same underlying 
set of dimensions when it is given to different 
groups of students. For further details on factor 
analysis and how it is used in validity research, 
readers may consult Kane (2006). 

Differential item functioning (DIF) analysis 
helps to identify test questions that may be 
functioning differently for groups of test-takers 
who have the same level of proficiency. Most 
DIF analyses compare individuals from differ- 
ent gender, race, or ethnic groups, but they can 
also be used to compare groups with different 
disability status. The assumption is that test 
takers of equal ability should have the same 
chance of correctly answering a test question 
regardless of the group they belong to. 

An item, or test question, is said to show 
DIF if, once groups are matched on a mea- 
sure of ability, one group gets a question 
right significantly more often than the other 
group does. This is just one simplified way to 
describe a common method, though the under- 
lying idea holds for most DIF procedures. 

If a question appears to show at least a 
moderate level of DIF, the question comes 
under review by specially trained subject matter 
experts. Depending on the result of that review, 
the question may be left as is, revised, or 
deleted from the pool of possible test questions. 
Testing officials may also decide not to include 
the question when determining test scores. 
Interested readers may consider Holland and 
Wainer (1993) for more information on DIF. 

Taken together, the procedures of factor 
analysis and DIF are powerful analytical tools 
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to help determine if a test is measuring the 
same thing for all students — both those with 
and without learning disabilities. This question 
is particularly important for anyone interested 
in interpreting scores from tests that states use 
for accountability purposes. 

Comparing Groups 

States rightly do not want to combine, for 
the purpose of accountability, the scores from 
different groups if the test is measuring differ- 
ent knowledge or skills for different groups. 
Doing so would provide an inaccurate picture 
of the combined group’s proficiency. 

Researchers at ETS have carried out studies 
that used DIF analysis and factor analysis to 
examine the fairness and validity of fifth grade 
math and science tests for students with learn- 
ing disabilities. The goal of both the factor 
analysis and the DIF studies was to explore 
whether or not the assessments were measur- 
ing the same underlying skills for the groups 
defined as follows; 

• Students without disabilities taking the 
test under standard conditions 

• Students with learning disabilities taking 
the test under standard conditions — i.e., 
without accommodations or modifica- 
tions (see the earlier discussion of the 
distinction between accommodations 
and modifications and what testing 
changes states typically consider to be in 
each category) 

• Students with learning disabilities taking 
the test with accommodations 

• Students with learning disabilities taking 
as the test with modifications 

The results of the factor analysis of the 
math test indicated that the test measured three 
related factors that appeared to be common for 



all of the groups (students without learning dis- 
abilities and students with learning disabilities 
testing under various conditions) listed above. 
The results of the factor analysis of the science 
test showed that the test measured a single fac- 
tor that appeared to be common for all of the 
same groups. 

The DIF analyses of the science test yielded 
no evidence that the items were functioning 
differently for the groups described earlier. 
Some DIF was detected on the math test in the 
comparison involving students with disabilities 
using math modifications. This result may be 
of interest for more focused research. 

However, neither test showed large amounts 
of DIF for any of the groups studied. The 
interested reader can find details of the fac- 
tor analyses for the math and science tests in 
Cook, Fignor, Steinberg, and Sawaki (2008), 
and Steinberg, Cline, and Sawaki (2008), 
respectively. The results of the DIF analysis 
for the math and science tests are presented in 
Cline, Cook, and Stone (2008). 

Summary 

The goal of presenting a level playing field 
for standardized testing is one of great con- 
sequence. Information gained from research 
studies such as those mentioned here may be 
used during test development to design tests 
that are more accessible for all students. 

In addition, knowledge obtained through this 
research may lead to new and more effective 
accommodations that will allow students with 
disabilities to demonstrate their proficiency 
without undue impediment. At FTS, these 
studies form part of a comprehensive program 
of research that is integral to the organization’s 
mission to provide assessments that are of high 
quality and are fair for all students. 
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